Handle errors from cluster_status #3735

clumens · 2024-11-15T15:56:16Z

I'm a little undecided on this patch at the moment - check out the changes to regression test output in the last patch. I think some of that could be mitigated by not printing the warnings at all. The errors are a little trickier. We want to print them out if we're not verbose (that's the entire point of the related issue) but that means we get all of them.

kgaillot

Only had time for first commit so far

lib/pacemaker/pcmk_scheduler.c

Use pcmk_unpack_scheduler_input instead.

… fails. This function can return a couple error codes, most notably when called on input with a feature set that is newer than the latest supported. In that case, the caller should return its own error instead f trying to continue on with an unpopulated scheduler object. This prevents a cascade of error messages.

Also, there's no need to do any error reporting. pcmk__config_err will have already called crm_err in this case.

The error message is hidden and only gets displayed if -V is given on the command line. Adding config error/warning handlers will cause the error to be displayed regardless. This could have been implemented in a couple ways, and there's tradeoffs here. I've chosen to duplicate what's happening in crm_verify, but instead of checking for verbosity (which is a global variable in that file), I'm checking out->is_quiet. This means that if you do `crm_simulate -Q`, you won't see the error message but you will get an error return code. This also means that `crm_simulate -Q -VVVV...`, you still won't see the error message. This may be a bug, but I'm not sure who would do that and I also think these sorts of problems are pervasive in our command line tools. Fix T521

This is just like the previous patch for crm_simulate, complete with all the same problems regarding -Q and -V.

This is just like the previous patch to crm_simulate. However, one additional problem here is that it relies on using the deprecated -Q command line option. On the other hand, I think this is okay because we have a lot of work to do straightening out these sorts of options for all our command line tools. This is just one more thing we'll have to deal with at that time.

This takes care of all callers of pcmk__output_cluster_status and pcmk__status. pcmk_status would also be affected, but at the moment there are no users of that function and anyway the config error handlers aren't public API.

The point of this is to allow it to return the value from unpack_cib, which is returning the value from cluster_status. This allows us to check whether that function hit the too-new feature set CIB condition.

…tions. This takes care of most callers - the ones in the daemons are unlikely to be a problem. This allows catching the too-new schema condition in various other tools and displaying an error message to the user. Note that a couple other callers don't need to check the return value. I've added comments explaining why.

* Remove the leading function name from various messages. This was most commonly "unpack_resources". * In the XML output format, move various messages from text that gets printed out to the XML output itself. This does end up with somewhat weird output with status="0" message="OK" followed by some error messages. * Add a couple warnings to crm_resource output.

kgaillot

I'm leaning to keeping the messages in crm_simulate output, they are issues that the user needs to know about

lib/pengine/status.c

include/crm/pengine/status_compat.h

lib/pacemaker/pcmk_simulate.c

lib/pengine/status.c

kgaillot · 2024-12-03T18:28:54Z

tools/crm_ticket.c

+
+    va_start(ap, msg);
+    pcmk__assert(vasprintf(&buf, msg, ap) > 0);
+    if (!out->is_quiet(out)) {


might as well just un-deprecate -Q

kgaillot · 2024-12-03T18:40:59Z

cts/cli/regression.crm_resource.exp

@@ -1730,6 +1757,7 @@ WARNING: Creating rsc_location constraint 'cli-ban-dummy-on-node1' with a score
 =#=#=#= End test: Move a resource from its existing location - OK (0) =#=#=#=
 * Passed: crm_resource          - Move a resource from its existing location
 =#=#=#= Begin test: Clear out constraints generated by --move =#=#=#=
+warning: More than one node entry has name 'node1'


Is this message accurate?

Doesn't look like it, though I don't yet see what could be causing that.

cts/scheduler/summary/bug-lf-1852.summary

kgaillot · 2024-12-03T18:48:53Z

cts/scheduler/summary/failcount-block.summary

@@ -1,3 +1,6 @@
+error: Ignoring invalid node_state entry without id
+warning: Ignoring failure timeout (10s) for rsc_pcmk-2 because it conflicts with on-fail=block
+warning: Ignoring failure timeout (10s) for rsc_pcmk-4 because it conflicts with on-fail=block


Most of the errors/warnings in the test cases are things we really should fix in the test case :(

(some may be testing the error though)

cts/scheduler/summary/leftover-pending-monitor.summary

cts/scheduler/summary/order-wrong-kind.summary

kgaillot reviewed Nov 18, 2024

View reviewed changes

lib/pacemaker/pcmk_scheduler.c Show resolved Hide resolved

clumens added 11 commits November 18, 2024 17:04

API: scheduler: Add pcmk_unpack_scheduler_input.

a7e1531

API: scheduler: Deprecate cluster_status.

c0cc157

Use pcmk_unpack_scheduler_input instead.

Refactor: daemons: Don't continue if pcmk_unpack_scheduler_input fails.

20e97f0

Also, there's no need to do any error reporting. pcmk__config_err will have already called crm_err in this case.

Low: tools: crm_resource should error on too-new feature sets.

6d1ecf6

This is just like the previous patch for crm_simulate, complete with all the same problems regarding -Q and -V.

Low: tools: crm_mon should error on too-new feature sets.

0e63dd2

This takes care of all callers of pcmk__output_cluster_status and pcmk__status. pcmk_status would also be affected, but at the moment there are no users of that function and anyway the config error handlers aren't public API.

Refactor: libpacemaker: pcmk__schedule_actions should return a value.

52c9dce

The point of this is to allow it to return the value from unpack_cib, which is returning the value from cluster_status. This allows us to check whether that function hit the too-new feature set CIB condition.

clumens force-pushed the cluster_status-errors branch from 07e8301 to b4f4b75 Compare November 19, 2024 17:18

kgaillot reviewed Dec 3, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle errors from cluster_status #3735

Handle errors from cluster_status #3735

clumens commented Nov 15, 2024

kgaillot left a comment

kgaillot left a comment

kgaillot Dec 3, 2024

kgaillot Dec 3, 2024

clumens Dec 3, 2024

kgaillot Dec 3, 2024

Handle errors from cluster_status #3735

Are you sure you want to change the base?

Handle errors from cluster_status #3735

Conversation

clumens commented Nov 15, 2024

kgaillot left a comment

Choose a reason for hiding this comment

kgaillot left a comment

Choose a reason for hiding this comment

kgaillot Dec 3, 2024

Choose a reason for hiding this comment

kgaillot Dec 3, 2024

Choose a reason for hiding this comment

clumens Dec 3, 2024

Choose a reason for hiding this comment

kgaillot Dec 3, 2024

Choose a reason for hiding this comment